Cross Validation

Statistics for Data Science II

Introduction

  • We are often interested in how well our model performs with “real” data.

  • We cannot truly determine the performance of the model with data used to construct the model as this data is considered biased.

  • In this lecture, we will learn about the validation set approach, leave-one-out cross-validation, and k-fold cross-validation.

  • Consider the penguin dataset as a basic example. Let us consider modeling penguin body mass (g) as a function of flipper length (mm).

    • M1: body mass \sim flipper

    • M2: body mass \sim flipper + flipper^2

Introduction

m1 <- glm(body_mass_g ~ flipper_length_mm, data = data)
summary(m1)

Call:
glm(formula = body_mass_g ~ flipper_length_mm, data = data)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5872.09     310.29  -18.93   <2e-16 ***
flipper_length_mm    50.15       1.54   32.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 154718.9)

    Null deviance: 215259666  on 332  degrees of freedom
Residual deviance:  51211963  on 331  degrees of freedom
AIC: 4928.1

Number of Fisher Scoring iterations: 2
data <- data %>% mutate(flipper2 = flipper_length_mm^2)
m2 <- glm(body_mass_g ~ flipper_length_mm + flipper2, data = data)
summary(m2)

Call:
glm(formula = body_mass_g ~ flipper_length_mm + flipper2, data = data)

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       16579.8692  4616.2466   3.592 0.000379 ***
flipper_length_mm  -171.6140    45.5245  -3.770 0.000194 ***
flipper2              0.5449     0.1118   4.874  1.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 144766.4)

    Null deviance: 215259666  on 332  degrees of freedom
Residual deviance:  47772915  on 330  degrees of freedom
AIC: 4907

Number of Fisher Scoring iterations: 2

Validation Set Approach

  • Cross validation starts by splitting data into training and validation datasets.
  • We will use:

    • the training data to construct the model

    • the validation data to determine how good the prediction of the model is.

  • Why do we need a separate (validation) dataset?

    • We need to use data that was not used to create the model. (i.e., unbiased data)

Validation Set Approach

  • We will use the mean square error (MSE) to assess how well the model performs.

    • squared error = (y_i - \hat{y}_i)^2

    • MSE = average of squared error for all observations in the validation set

  • We should construct multiple training and validation sets.

    • When we construct multiple models, we will choose the model with the lowest MSE.
  • We will use the tidymodels package to split the data.

Validation Set Approach

data <- data %>%
  mutate(obs = row_number()) %>%
  relocate(obs)
head(data)
set.seed(906282) # reproducible sampling
split <- initial_split(data, prop = 0.5)
training <- training(split)
validation <- testing(split)
head(training)

Validation Set Approach

  • Now, we will construct our models using the training dataset.
m1v <- lm(body_mass_g ~ flipper_length_mm, data = training)
m2v <- lm(body_mass_g ~ flipper_length_mm + flipper2, data = training)
  • We now need to compute the MSE for each model using the validation dataset.

    • First, we will set up the squared errors, (y_i - \hat{y}_i)^2, under each model.

    • Then, we take the average of the squared error to find the MSE for each model.

Validation Set Approach

validation <- validation %>% 
  mutate(yhat_m1 = predict(m1v, newdata = validation),
         yhat_m2 = predict(m2v, newdata = validation)) %>%
  mutate(sqerr_m1 = (yhat_m1 - body_mass_g)^2,
         sqerr_m2 = (yhat_m2 - body_mass_g)^2) %>%
  relocate(obs, body_mass_g, yhat_m1, sqerr_m1, yhat_m2, sqerr_m2)
head(validation)

Validation Set Approach

  • Looking at the MSE,
mean(validation$sqerr_m1, na.rm = TRUE)
[1] 171215.7
mean(validation$sqerr_m2, na.rm = TRUE)
[1] 153882.4
  • Of the two candidate models, the model including the quadratic term (M2) gives the lowest MSE.

Leave-One-Out Cross-Validation

  • The leave-one-out cross-validation is similar to what we discussed under the validation set approach.

  • We now leave out a single observation.

Leave-One-Out Cross-Validation

  • As we are only excluding a single observation, we now have n MSEs to consider. \text{MSE}_i = (y_i - \hat{y}_i)^2

  • Because the MSE is now based on a single observation, the variability is high.

  • Thus, we then consider the leave-one-out cross-validation estimate for the test MSE, \text{CV}_{(n)} = \frac{\sum_{i=1}^n \text{MSE}_i}{n}

  • These are useful when considering various models in terms of what is being included as predictors (e.g., higher order polynomial terms).

Leave-One-Out Cross-Validation

  • Note that as n increases and as the model complexity increases, it will be computationally intensive/expensive to implement this method.

  • If using least squares regression, we can use the following estimate: \text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1-h_i} \right)^2,

  • where h_i is the leverage as defined in the lecture on model assumptions and diagnostics.

Leave-One-Out Cross-Validation

  • In our example,
set.seed(6714)
m1 <- glm(body_mass_g ~ flipper_length_mm, data = data)
cv.glm(data, m1)$delta
[1] 155563.6 155560.9
set.seed(19571)
m2 <- glm(body_mass_g ~ flipper_length_mm + flipper2, data = data)
cv.glm(data, m2)$delta
[1] 145759.7 145756.3
  • The first value is the estimated CV_{(n)} while the second value is the true CV_{(n)}.

  • The model with the quadratic term has a lower test CV_{(n)}, thus, fits better.

k-Fold Cross-Validation

  • An alternative to leave-one-out cross-validation is the k-fold cross-validation.

  • Instead of leaving a single observation out, we now leave a group of observations out.

k-Fold Cross-Validation

  • We are dividing the dataset into k groups (or folds) of approximately equal size.

    • We treat the first group as the validation set.
    • The other k-1 groups are used to construct the model.
    • Then, we compute the MSE on the first group.
  • We repeat this process k times, giving us k estimates of the test error.

  • We construct CV_{(k)} as the average MSE, \text{CV}_{(k)} = \frac{\sum_{i=1}^k \text{MSE}_i}{k}

  • Note that leave-one-out cross-validation is a special case of k-fold cross-validation, where k=n.

k-Fold Cross-Validation

  • Suppose we are interested in 10-fold validation.

  • In our example,

set.seed(75628)
m1 <- glm(body_mass_g ~ flipper_length_mm, data = data)
cv.glm(data, m1, K=10)$delta
[1] 155341.6 155258.9
set.seed(24681)
m2 <- glm(body_mass_g ~ flipper_length_mm + flipper2, data = data)
cv.glm(data, m2, K=10)$delta
[1] 145299.5 145203.5
  • The first value is the estimated CV_{(k)} while the second value is the true CV_{(k)}.

  • The values above indicate that the model with the quadratic term offers a better fit.

Cross-Validation in Classification Problems

  • What if we are working with categorical data and using logistic regression?

  • Instead of calculating a CV_{(n)} based on the MSE, we will now base it on the number of misclassified observations. \text{CV}_{(n)} = \frac{\sum_{i=1}^n \text{Err}_i}{n},

  • where Err_i = I(y_i \ne \hat{y}_i). (i.e., is the count of the number of misclassifications.)

  • We will again use the glm() function for modeling and then employ the cv.glm() function for cross-validation.

Cross-Validation in Classification Problems

  • Consider the following example: A researcher is interested in how student characteristics affect admission into graduate school. We are interested in modeling the graduate school admission as a function of GRE, college GPA, and prestige of the undergraduate institution.
library(gsheet)
data <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1fCIhZTf4BnE_Xly4zp8Cg_cz4wAYrQN0WPN9vDnqSEE/edit#gid=0")
head(data, n = 3)

Cross-Validation in Classification Problems

  • Let us compare models with/without the prestige of the undergraduate institution predictor using leave-one-out cross-validation:
set.seed(65730)
m1 <- glm(admit ~ gre + gpa + rank, data = data, family = "binomial")
cv.glm(data, m1)$delta
[1] 0.1992860 0.1992809
set.seed(162267)
m2 <- glm(admit ~ gre + gpa, data = data, family = "binomial")
cv.glm(data, m2)$delta
[1] 0.2099091 0.2099051
  • Which model fits better? Why?

Cross-Validation in Classification Problems

  • Let us repeat the last example under 10-fold cross-validation:
set.seed(999)
cv.glm(data, m1, K=10)$delta
[1] 0.1991359 0.1989378
set.seed(7816)
cv.glm(data, m2, K=10)$delta
[1] 0.2104978 0.2103004
  • Which model fits better? Why?

Conclusion

  • Cross-validation is another tool in our toolbox for quantifying the error in our models.

    • When we use it to compare several candidate models, we can quantify that reduction in error.
  • While our examples showed that specific models fit better, it could be the case that the reduction is very small.

    • It may not be “worth” a more complicated model for a small reduction in error.